Trajectory Sampling Value Iteration: Improved Dyna Search for MDPs

نویسندگان

  • Yicheng Zhou
  • Quan Liu
  • Qi-ming Fu
  • Zongzhang Zhang
چکیده

Traditional online learning algorithms often suffer from the lack of convergence rate and accuracy. The Dyna-2 framework, combining learning with searching methods, provides a way of alleviating the problem. The main idea behind it is to execute a simulation-based search that helps the learning process to select better actions. The search process relies on a simulated model of the environment that is built during learning. However, the model is not fully used in Dyna-2. To provide better solution quality, our paper improves the algorithm by applying value iteration, a model-based dynamic programming algorithm, to the search process with a trajectory sampling approach (DynaTSVI). Trajectory sampling is used to reduce high time complexity caused by dynamic programming. Experimentally, we use the Dyna Maze and the Windy Grid World tasks to analyze the proposed method in several aspects. Our results show that DynaTSVI outperforms Dyna-2 in both deterministic and stochastic environments in terms of convergence rate and accuracy.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Learning Exploration/Exploitation Strategies for Single Trajectory Reinforcement Learning

We consider the problem of learning high-performance Exploration/Exploitation (E/E) strategies for finite Markov Decision Processes (MDPs) when the MDP to be controlled is supposed to be drawn from a known probability distribution pM(·). The performance criterion is the sum of discounted rewards collected by the E/E strategy over an infinite length trajectory. We propose an approach for solving...

متن کامل

State Aggregation in Monte Carlo Tree Search

Monte Carlo tree search (MCTS) algorithms are a popular approach to online decision-making in Markov decision processes (MDPs). These algorithms can, however, perform poorly in MDPs with high stochastic branching factors. In this paper, we study state aggregation as a way of reducing stochastic branching in tree search. Prior work has studied formal properties of MDP state aggregation in the co...

متن کامل

Heuristic Search for Generalized Stochastic Shortest Path MDPs

Research in efficient methods for solving infinite-horizon MDPs has so far concentrated primarily on discounted MDPs and the more general stochastic shortest path problems (SSPs). These are MDPs with 1) an optimal value function V ∗ that is the unique solution of Bellman equation and 2) optimal policies that are the greedy policies w.r.t. V ∗. This paper’s main contribution is the description o...

متن کامل

Solving Large MDPs Quickly with Partitioned Value Iteration

Value iteration is not typically considered a viable algorithm for solving large-scale MDPs because it converges too slowly. However, the performance of value iteration can be dramatically improved by eliminating redundant or useless backups, and by backing up states in the right order. We present several methods designed to help structure value dependency, and present a systematic study of com...

متن کامل

A Tutorial on Linear Function Approximators for Dynamic Programming and Reinforcement Learning

A Markov Decision Process (MDP) is a natural framework for formulating sequential decision-making problems under uncertainty. In recent years, researchers have greatly advanced algorithms for learning and acting in MDPs. This article reviews such algorithms, beginning with well-known dynamic programming methods for solving MDPs such as policy iteration and value iteration, then describes approx...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2015